﻿ Student Projects in Language and Speech Technology ♠♥ Dan Cristea♥♣ Horia-Nicolai Teodorescu♦♣ Dan-Ioan Tufiş dcristea@infoiasi ro hteodor@etc tuiasi ro tufis@racai ro ♥ University “Al I Cuza” of Iaşi, Faculty of Computer Science ♦ Technical University of Iaşi, Faculty of Electronics and Telecommunications ♣ Romanian Academy – the Iaşi Branch, Institute for Theoretical Computer Science ♠ Romanian Academy, Research Institute for Artificial Intelligence Abstract The paper reports on term homework and projects in correlation with Natural Language Processing courses delivered at the Faculty of Computer Science of the “Al I Cuza” University of Iaşi (UAIC-FII) to both undergraduate students in Computer Science and graduate students enrolled in the Master Programme in Computational Linguistics, during the university year 2003-2004 All projects make heavy use of language resources, in either written or spoken form to subsume a node (standard) B in the hierarchy (therefore 1 Introduction B is a descendent of A) if and only if: At UAIC-FII, two categories of courses are taught in - any tag-name of A is also in B; different areas of natural language and speech processing - any attribute in the list of attributes of a tag-name in At the undergraduate level, the terminal year students can A is also in the list of attributes of the same tag-name take an elective course on natural language processing of B; (which presents mainly theories and techniques for - any semantic relation which holds in A also holds in discourse level representation and processing of B; language), and at the master level, the students in - either B has at least one tag-name which is not in A, Computational Linguistics, over the two years program, and/or there is at least one tag-name in B such that at take a general introductory course on computational least one attribute in its list of attributes is not in the linguistics, a theoretical course on syntax, a course on list of attributes of the homonymous tag-name in A, corpus linguistics, one in lexical semantics, one on and/or there is at least one semantic relation which machine translation and one on speech processing The holds in B and which doesn’t hold in A laboratory activities of both undergraduate and graduate As such, a hierarchical relation between a node A and one courses make heavy use of corpora and other linguistic descendent B describes B as an annotation standard which resources In these activities students usually work in is more informative than A and/or defines more semantic teams to perform term projects and homework The goal constrains On a lattice of annotation standards of this of all projects is three fold: a) to train students to exploit kind and a collection of documents annotated existing corpora through the program interfaces available corresponding to these standards, all having the same hub on the web, by integrating function calls in their own document (original empty-annotation document), a set of applications; b) to train students to make use of operations could be defined By proper definitions or by annotations applied to texts, by first manually annotating classifying a set of documents obeying different a small corpus and then devising their own tools that standards, a hierarchy can be build Then, on such a exploit the annotation for NLP applications; c) to build hierarchy and a corresponding collection of documents, small annotated corpora, including spoken language merge and extract operations can be defined A merge (speech) corpora, and appropriate exploitation software operation combines two documents having identical hubs that remain in the Faculty for further NLP research and corresponding to two distinct nodes of the hierarchy and produces the document containing the union of the otation tags of the two original documents and the 2 Undergraduate projects anncorresponding standard in the hierarchy The resulted One category of undergraduate projects dealt with Textual standard will then be placed in the hierarchy as a common resources: acquisition and exploitation All projects were daughter of the original document standards An extract group projects, 4-5 students each 1 applies the reverse operation, extracting from a document, Lattice of XML annotation standards: the students which corresponds to a certain node in the hierarchy, a working in this project had to propose a technique to document conforming to one of the node’s ascendants in transform an XML annotation standard into a node of a the hierarchy Concurrent-checks could also be defined directed acyclic graph Details on this project can be Two annotations are called concurrent if they intend to found in (Cristea and Butnariu, 2004) A node of the represent the same linguistic phenomenon from different hierarchy records a set of tag names (XML element tags), perspectives, therefore possibly resulting in different their corresponding attributes, and possible semantic solutions Viewed within the frame of the hierarchical relations between attributes Any node inherits all features graph representation, concurrent documents cannot be of its parents in the hierarchy A node (standard) A is said merged in a single file and its corresponding standard for the reason that the resultant annotation would contain crossing markings 1 The projects are posted (in Romanian) at the following URL: The project, realized in Java-DOM, intended to train http://thor info uaic ro/~dcristea/cursuri/LC/lcproiecte htm student’s abilities to model and realize complex applications dealing with annotation standards More than uploading site is a didactic benefit, the resulted tool proved to be already http://www3 infoiasi ro/~toni/lingvistica/index php extremely useful for the corpora-driven research Although being perceived as tedious by most students, developed in our NLP laboratory excepting by those who had to build the uploading and Semantic structures for automatic translation A large filtering interface, the theme raised the interest for a project aimed at training students on techniques of concentrated web search group activity that can result in automatic translation 10 groups of students, 4 members the acquisition of a consistent linguistic resource, each, received the George Orwell’s “1984” English-extremely useful for NLP research Romanian aligned parallel corpus, initially tagged in both languages to part-of-speech Students had to recognize senses of words and to annotate these senses conforming 3 Postgraduate projects to two aligned lexical thesaurus, the English WordNet The following series of projects were given to master (Fellbaum, 1998), as part of the EuroWordNet project students in Computational Linguistics, first and second (Vossen, 1997), and the Romanian WordNet (RoWN) year (Tufis and Cristea, 2002), as part of the BalkaNet project Assembling parts of a wordnet The project dealt with (Tufis, Cristea and Stamou, 2004), and to build parallel the Romanian wordnet, part of the BalkaNet multilingual semantic frames of translation-equivalent verbs More wordnet, using PWN as an inter-lingual index After a precisely, their task was: detailed presentation of the Princeton WordNet (PWN) - to find all verb occurrences in English and to sort and of the multilingual architecture of the BalkaNet them in the descending order of their frequency; project, the students were trained in using the WNBuilder - among the most frequent verbs, each group had to acquisition tool WNBuilder (Tufis&Barbu, 2004) is a choose 10 English verbs and to select from the user-friendly interface integrating various language parallel corpus all the language-pair segments of resources for Romanian and English (Explanatory occurrences; Dictionary, Dictionary of Synonyms, Princeton WordNet, - they had to annotate the occurrences of these verbs, Romanian-English translation dictionary) and allowing in both English and Romanian, to senses (Inter for collaborative work in synsets definition and their Lingual Index codes), according to EuroWordNet and linking to the counterpart synsets in PWN Each student RoWN; had to construct for every synset in distinct sets of English - then subcategorisation constituents of verbs had to be synsets extracted from PWN the corresponding Romanian annotated: their syntactic role, the head word, and the synsets Building a synset assumed identifying a sense of the head word – using also the EuroWordNet synonymy list from the Romanian Dictionary of ILI codes; Synonyms (RDS), assigning sense numbers to each literal, - then, students had to select all occurrences in which a based on the numbering of senses in the Explanatory verb was considered to have the same sense and to Dictionary of Romanian (EDR), choosing from EDR the generalize a semantic frame out of the set of most adequate definition for the synset and finally constituents found around it For a given constituent, establishing the interlingual relation with the starting say the direct object role, the generalization had to be point synset from PWN There are various interlingual the lowest concept in the wordnet hierarchy relations as defined in EuroWordNet (Vossen, 1997): EQ- subsuming all senses of head words found on the role SYN, EQ-HYPERONYM, EQ-HYPERONYM, EQ- of direct object in the selected examples If no MERO, EQ-HOLO The students had to overcome (and generalization of this kind could be found, due to the explain their solutions) different difficulties arising from: fact that, for each part-of-speech, wordnet contains a different granularities of the underlying dictionaries, collection of graphs, not just one, the union of the PWN and RDS&EDR, lexical gaps, missing senses in lowest computed role-concepts was computed; EDR, splitting conjunctive definitions, etc - the final goal was to report a collection of English-A special tool WNCorrect (Tufis&Barbu, 2004) was used Romanian frames around verbs that have given rise to to evaluate and correct each student’s synsets The parallel translations, which could be considered the evaluation tool provided detailed reports on the syntactic kernel of a semantic transfer grammar and semantic errors that allowed an objective assessment Although the results were rather unequal, from of each student’s work spectacularly good to poor, overall the project was Sense disambiguation To further refine the students successful, since after finishing it most students had a understanding of lexical semantics issues, in a second task very clear sense of the advantages of using annotated they were engaged in a word-sense disambiguation corpora in NLP, and they learned the technology to obtain exercise Each student was given a set of English annotated corpora and to exploit them Moreover, the best sentences containing several occurrences of different rated projects have thrown the seeds for furthers master-target words The students had to semantically level research disambiguate all the targeted words by choosing the sense Acquisition of a corpus of Romanian texts All students numbers from PWN The context for sense had to find on the Web, collect and annotate conforming disambiguation exercise was defined by the sentence to the PAROLE schema (Villegas et al , 2000) (document containing the targeted word Additionally, each student header and paragraph level annotation) a corpus of had the possibility to add comments whenever in doubts Romanian texts summing one million words One group on the appropriate sense assignment The comments was responsible to build an interface aiming to facilitate indicated, among other thing, insufficient context and too the uploading of individual corpora, the validation of the fine-grained sense distinctions PAROLE headings and the filtering of uploads, in order to allow only PAROLE-conformant new texts The The set of English targeted words were extracted from the parallel corpus “1984” so that all their senses (at least two - to help student better understand speech signal per part-of-speech) defined in PWN were also processing, and more broadly, speech technology; implemented (and interlingually aligned) in the RoWN - to help student better understand the meaning of the There resulted 211 words with 1832 occurrences An prosody and its characteristics; extraction script generated, for each student a set of - to help students understand speech production; sentences containing occurrences of the targeted words - to help students understand inter-speakers and intra- The extraction process ensured that the same sentence was speaker speech variability; in at least three student-sets Therefore, in the end, each - to help students master the technical tools to analyze occurrence of a targeted word was sense disambiguated voice signals; by at least three students The same-targeted words were - to help students understand formantic and automatically disambiguated by the WSDtool system concatenative synthesis, their principles, their relative (Tufis et al 2004a; Tufis, Ion and Ide, 2004b) WSDtool advantages and their disadvantages; exploits the multilingual wordnets in BalkaNet (in this - to train students in acquiring small collections of case the pair RoWN&PWN) and build on the TREQ-AL speech data word-alignment program (Tufis, Barbu and Ion, 2003) We believe that all these goals structure the knowledge The previous evaluation of the WSDtool performance on needed to build sound speech corpora the same data has shown high accuracy (>85%) and as a We have an experience of two years of teaching this result of that evaluation, we were able to construct a gold course It is important to say that the class has been, standard against which the students’ assignments were during both years, quite heterogeneous Classes included evaluated about 50% students with a background in linguistics, The evaluation files contained detailed information for about 25 % students who graduated computer science each occurrence of the targeted word: (informatics) and the rest of about 25% of the students - the name of the student that evaluated the occurrence with other background studies (including such varied and the sense he/she assigned; studies as philosophy) - the comments the student had on the sense Figure 1 illustrates some results obtained by a student A assignment in case; special emphasis has been put on hands on work by - the sense in the gold standard; students in the laboratory and during their supervised - a majority-voting sense number as resulted from the work for a mini-project The mini-project has been students’ sense assignments tailored to encompass a large section of the theory Speech processing A second category of master student covered in the class Namely, the project asked students to projects exploited speech data in correlation with the fulfill the following steps: course Speech Processing (Analysis, Recognition, and Synthesis) The goal of this project was multifold: Figure 1: Example of sonogram of a phrase, annotated at the sentence, word and phoneme levels in a student project The red curve represents the trajectory of the pitch and is used to determine intonation (prosody) 1 make several recordings with their own voice for completed about 25-40% of it However, the starting work spelled with various intonations, and under has been painful for some of them, because of their various experimental conditions; limited knowledge in one or several fields that the project 2 perform basic analysis of the uttered vowels, involves (signal processing theory, phonetics, basic words and phrases: Fourier spectra, sonograms, human physiology, or statistical analysis and pattern guessing the formant traces and determining the analysis) The best students were encouraged to continue pitch evolution; this class project as research projects One of them 3 draw the “vowel triangle”; successfully contributed, as an assistant research team 4 view the waveforms and determine their basic member, in a research based on a national grant Others properties (amplitude, number of zero-crossings, have had small contributions to the analysis of a periodicity, period); phonologic thesaurus (created by the Romanian 5 study the stationarity/unstationarity aspects in the Academy) and have become familiar with the generation waveform; of linguistic atlases for the Romanian language It should 6 study the influence the recording conditions have be emphasized that such studies are also continued at the on the waveform and on the spectrograms, doctoral level, and aim to deal with more intricate chiefly on the amplitude envelope, pitch and subjects, such as the nonlinear speech analysis (Rodriguez , Teodorescu and Apopei, 1998) formants; et al , 2000; Grigoraş 7 segment propositions, words, syllables, and starting from deeply annotated speech corpora phonemes; Overall, the projects raised the students’ interest for 8 annotate the spectrograms for the above; corpus-based applications related to NLP Many of the 9 perform a basic statistical analysis of the formant projects, perceived by students as rather complex, were frequencies for vowels; intended also to challenge students for research activities 10 synthesize – using a Klatt voice synthesizer For instance the two teams that performed the best the (Jitca et al , 2002) – the vowels and assess their semantic structures for translation, who have reached the quality expected final level, reported that the project overtook in Moreover, comments on the formants and the complexity any of the projects they had to accomplish corresponding time series have been required for higher over the university years, but also helped them to grades The results of their activities had to be saved in understand more intimately the methodology of a real speech files annotated with information that indicates linguistic research activity We believe that terminal year words, syllables and phonemes boundaries undergraduate students and master students, faced to activities of high degree of complexity, could be 4 Conclusions stimulated to join real research projects at master or The project on XML standards resulted in the doctoral level development of a handily and very useful interface to exploit the richly XML-based annotated corpora We References intend to apply the methodology of hierarchical Cristea, D and Butnariu C (2004) Hierarchical XML organisation of standards, developed as part of the project, representation for heavily annotated corpora, in and the resulted interface to further class projects in CL Proceedings of the LREC 2004 Workshop on XML- and to the development of our own resources, (mainly Based Richly Annotated Corpora, Lisbon, Portugal acquired by master students in CL) Grigoras, Fl , Teodorescu, H N , Apopei, V (1998) Many students did not interpret the semantic structures Nonlinear Analysis and Synthesis of Speech, in Studies project as an easy-to-do one Most of the groups in Informatics and Control, vol 7, no 1, March accomplished the sense annotations of verbs, but fewer Fellbaum, C (1998) WordNet An Electronic Lexical realized also the sense annotation of constituents Only Database The MIT Press two groups accomplished all tasks on the specification Jitca, D , Teodorescu, H N , Apopei, V , Grigoras, Fl list, realizing the sense generalization programs and (2002) Improved Speech Synthesis Using Fuzzy assembling the parallel frames Methods, in Int J Speech Technology (Kluwer), Although initially we had no great expectations with September, Volume 5, Issue 3 respect to the outcomes of the corpora acquisition Rodriguez, W , Teodorescu, H N , Grigoras, Fl , Kandel, individual tasks, the initiative resulted in the acquisition A , Bunke, H (2000) A Fuzzy Information Space of an opportunistic corpus of about 82 million Romanian Approach to Speech Signal Non-Linear Analysis, in words This will be integrated into the resources for International Journal of Intelligent Systems (Wiley), Romanian language as part of the activity of the vol 15, no 4, April Consortium for the Informatization of the Romanian Stamou, S , Oflazer, K , Pala, K , Christoudoulakis, D , , D , Koeva S , Totkov, G , Dutoit, Language (http://consilr info uaic ro/~pic/) Cristea, D , Tufiş At the time of this writing, we haven’t finished yet to D , Grigoriadou, M (2002) BALKANET A analyse the output of the project on Wordnet and sense Multilingual Semantic Network for the Balkan disambiguation, but the results will be soon available on Languages, in Proceedings of the International the Internet The findings of this investigation will be Wordnet Conference, Mysore, India , D , Cristea, D (2002) Methodological issues in discussed during the workshop Tufiş The project in speech technology offered students the building the Romanian Wordnet and consistency opportunity to grasp the knowledge and skills needed in checks in Balkanet, in Proceedings of the Workshop on building spoken language corpora Most students have Wordnet Structures and Standardization, and how these shown a great interest in the project, after they have affect Wordnet Applications and Evaluation, workshop in conjunction with The Third International Conference on Language Resources and Evaluation, LREC-2002 28-31 May, Las Palmas, Spain Tufis, D , Barbu, A M , Ion, R (2003) A word-alignment system with limited language resources, in Proceedings of the NAACL 2003 Workshop on Building and Using Parallel Texts; Romanian-English Shared Task, Edmonton, Canada Tufis, D , Barbu, E (2004) A Methodology and Associated Tools for Building Interlingual Wordnets, in Proceedings of LREC2004, Lisbon, Portugal Tufiş, D , Ion, R , Barbu, E , Barbu, V (2004a) Cross- Lingual Validation of Multilingual Wordnets, in Proceedings of Global Wordnet Conference, Brno, Czech Republic Tufiş, D , Ion, R , Ide, N (2004b) Word sense disambiguation as a wordnets validation method in Balkanet, in Proceedings of LREC2004, Lisbon, Portugal Tufis, D , Cristea, D , Stamou, S (2004 forthcoming) BalkaNet: Aims, Methods, Results and Perspectives A General Overview, to appear in Romanian Journal on Science and Technology of Information, Romanian Academy, Bucharest, Romania Villegas, M , Bel N , Lenci, A , Calzolari, N , Cataldo, G , Zampolli, A , Sadurni, T , Soler i Bou, J (2000) Multilingual Linguistic Resources: From Monolingual Lexicons to Bilingual Interrelated Lexicons, in Proceedings of the LREC 2000 2nd International Conference on Language Resources & Evaluation, Athens, Greece Vossen, P (1997) EuroWordNet: a multilingual database for information retrieval, in Proceedings of the DELOS workshop on Cross-language Information Retrieval, March 5-7, Zurich, Switzerland 